About




AVIVA: insuring people since 1696

CUSTOMERS: ca. 15M in the UK alone

QUANTUM: 600+ data practitioners globally

CUSTOMER SCIENCE: “Know customers, take better actions”

“Clickstream” is an important data source for any online business



User Agent String is a HTTP request header that describes the software acting on the user’s behalf







Original purpose of UAS was “content negotiation” with servers

UAS contain many useful data points




UAS have use cases beyond content negotiation




UAS contain useful proxies of user characteristics





Oftentimes, parsing and one-hot encoding won’t be enough to create such features


Let’s embed UAS into a low-dimensional space!

  • Can be done in a variety of ways

  • fastText (Bojanowski et al. 2016) is one particularly useful algorithm:

    • not data-hungry
    • works well on short documents
    • fast to train
    • out-of-vocabularly words not a problem

Using the official fasttext Python library in R is easy, thanks to reticulate

# Install `fasttext` first (see https://fasttext.cc/docs/en/support.html)

# Load the `reticulate` package
require(reticulate)

# Make sure `fasttext` is available to R:
py_module_available("fasttext") 
## [1] TRUE

# Load `fasttext`:
ft <- import("fasttext")

# Then call the required methods using the `$` notation, e.g.: `ft$train_supervised`

fastText transformers can be trained in both unsupervised and supervised modes


Example dataset

A sample of 200,000 unique UAS from the whatismybrowser.com database



Unsupervised training of a fastText transformer


m_unsup <- ft$train_unsupervised(input = "./data/train_data_unsup.txt",
                                 model = "skipgram",
                                 lr = 0.05, 
                                 dim = 32L, # vector dimension
                                 ws = 3L, 
                                 minCount = 1L,
                                 minn = 2L, 
                                 maxn = 6L, 
                                 neg = 3L, 
                                 wordNgrams = 2L, 
                                 loss = "ns",
                                 epoch = 100L, 
                                 thread = 10L)

Getting the UAS vector representations

test_data <- readLines("./data/test_data_unsup.txt")

emb_unsup <- test_data %>% 
  lapply(., function(x) {
    m_unsup$get_sentence_vector(text = x) %>% # returns average vector for a UAS
      t(.) %>% as.data.frame(.)
  }) %>% 
  bind_rows(.) %>% 
  setNames(., paste0("f", 1:32))

emb_unsup[1:3, 1:10]
##      f1       f2    f3    f4      f5     f6      f7    f8     f9    f10
## 1 0.197 -0.03726 0.147 0.153  0.0423 0.0488  0.0196 0.132 0.1946  0.186
## 2 0.182  0.00307 0.147 0.101  0.0326 0.0847 -0.0174 0.108 0.1957  0.171
## 3 0.101 -0.28220 0.189 0.202 -0.1623 0.2622  0.1386 0.106 0.0733 -0.035

The resultant embeddings are quite useful

Adding labels to data



Supervised training of a fastText transformer


m_sup <- ft$train_supervised(input = "./data/train_data_sup.txt",
                             lr = 0.05, 
                             dim = 32L, # vector dimension
                             ws = 3L, 
                             minCount = 1L,
                             minCountLabel = 10L, # min label occurence
                             minn = 2L, 
                             maxn = 6L, 
                             neg = 3L, 
                             wordNgrams = 2L, 
                             loss = "softmax", # loss function
                             epoch = 100L, 
                             thread = 10L)

The resultant embeddings are even better!

Take-home messages